Wednesday 7/6/22
# For this demonstration, I will use the Team Standard Batting, but you will be using the dataset that you will
# choose within your teams
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
requests to Load the Webpage into Python¶The URL of this page should be: https://www.baseball-reference.com/leagues/majors/2021.shtml
team_stats_Request = requests.get('https://www.baseball-reference.com/leagues/majors/2021.shtml')
# use requests to get the website
type(team_stats_Request)
requests.models.Response
print(team_stats_Request) # as you can see, you can't do much with a raw Response object
<Response [200]>
Note that the URL contains 2021. To find all web pages from 2000 to 2021, loop over the url by making it a f-string where the year changes with each iteration.
BeautifulSoup Object¶team_stats_soup = BeautifulSoup(team_stats_Request.text)
type(team_stats_soup) # the object we created with BeautifulSoup() is an object of type BeautifulSoup
bs4.BeautifulSoup
Open the SelectorGadget Link, and a small toolbar should appear at the bottom of your screen. Every time you hover over a part of the website, a box should appear around that part
Keep clicking on either (1) unhighlighted parts you do want or (2) highlighted parts you don't want until only what you want is selected
As you can see, only the batting table is selected
Now, copy the HTML nodes that Selector Gadget gives you
For this table, the nodes are: #teams_standard_batting .left , #teams_standard_batting .right, #teams_standard_batting .center
select() method on a BeautifulSoup Object with SelectorGadget to Access Website Text¶hittingTable = team_stats_soup.select(
'#teams_standard_batting .center , #teams_standard_batting .left, #teams_standard_batting .right'
) # using our nodes plug them in to select() as a string
The obejct hittingTable is an iterable containing each cell of the table. To access the information in each cell, you must use a loop along with the text attribute. For simplicity, I will use the first 3 elements of hittingTable to demonstrate.
hittingTable[0:3] # as you can see, directly accessing the object is not useful
[<th aria-label="Tm" class="poptip sort_default_asc left" data-stat="team_name" scope="col">Tm</th>, <th aria-label="#Bat" class="poptip center" data-stat="batters_used" data-tip="<strong>Number of Players used in Games</strong>" scope="col">#Bat</th>, <th aria-label="BatAge" class="poptip sort_default_asc center" data-stat="age_bat" data-tip="<strong>Batters&#x2019; average age</strong><br>Weighted by AB + Games Played" scope="col">BatAge</th>]
hitting_table_elements = []
for element in hittingTable[0:3]:
hitting_table_elements.append(element.text)
print(hitting_table_elements) # creating a loop and iterating through the elements is the more useful route
['Tm', '#Bat', 'BatAge']
Using the data that can be accessed from the BeautifulSoup object as shown above, and iterating through the table from each year from 2000 to 2021, your final product should look like this:
It should be a pandas DataFrame, with the year of the team (which is not in the baseball reference tables; you will have to add this in yourself) and every team's stats from every year between 2000 and 2021.
Here are two important guidlines to follow so that your IP Address does not get banned:
| Hitting | Pitching | Fielding |
|---|---|---|
| Arnav | Anish | Nicole |
| Victor | Aaron | Maddie |
| Avnish | Vince |